Document Representation Using Global Association Distance Model

نویسندگان

  • José Eladio Medina-Pagola
  • Ansel Y. Rodríguez González
  • Abdel Hechavarría Díaz
  • José Hernández Palancar
چکیده

Text information processing depends critically on the proper representation of documents. Traditional models, like the vector space model, have significant limitations because they do not consider semantic relations amongst terms. In this paper we analyze a document representation using the association graph scheme and present a new approach called Global Association Distance Model (GADM). At the end, we compare GADM using K-NN classifier with the classical vector space model and the association graph model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Formal Distance vs. Association Strength in Text Processing

Text information processing depends critically on the proper document representation. Traditional models, like vector space model, have significant limitations because they do not consider semantic relations amongst terms. In this paper we analyze a document representation that use an association graph scheme model called Global Association Distance Model or GADM, the significance of the formal...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features

Automatic organizing documents through a hierarchical tree is demanding in many real applications. In this work, we focus on the problem of content-based document organization through a hierarchical tree which can be viewed as a classification problem. We proposed a new document representation to enhance the classification accuracy. We developed a new hybrid neural network model to handle the n...

متن کامل

Oscillating Feature Subset Search Algorithm for Text Categorization

A major characteristic of text document categorization problems is the extremely high dimensionality of text data. In this paper we explore the usability of the Oscillating Search algorithm for feature/word selection in text categorization. We propose to use the multiclass Bhattacharyya distance for multinomial model as the global feature subset selection criterion for reducing the dimensionali...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007